Checking installation and loading packages

As usual we first always check and load in our required packages.

# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')

library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)

Sampling distribution of the mean

This week we will be learning a lot about the sampling distribution of the mean.

In last weeks lab you were introduced to a dataset looking at social media use in young adults. That data comes from a research programme run here at UNSW. However, this experiment has been repeated at 500 universities across the world, to get an in-depth global understanding of social media use in young adults. Today you are going to look at your dataset from the last computing lab, and the data from across 500 universities.

Checking the mean of time_on_social from last week

First we want to have another look at the PSYC2001_social-media-data.csv dataset from last week. To do this we first load it in using the sameread.csv() function combined with here().

social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in CSV files

Next we want to look at the mean of the variable time_on_social. We are going to do this in the same way we did in the last tutorial using by first altering all instances of -999 to become NA and then using the summmary() function.

social_media_NA <- social_media %>%
  mutate(time_on_social = na_if(time_on_social,-999)) #mutate alters columns and rows.
                                                      #na_if replaces -999 with NA.

summary(social_media_NA) #provides a summary of all variables in the data. 
##       id                 age        time_on_social      urban    
##  Length:60          Min.   :13.90   Min.   :1.240   Min.   :1.0  
##  Class :character   1st Qu.:15.70   1st Qu.:2.010   1st Qu.:1.0  
##  Mode  :character   Median :16.50   Median :2.410   Median :1.5  
##                     Mean   :16.87   Mean   :2.539   Mean   :1.5  
##                     3rd Qu.:17.43   3rd Qu.:3.047   3rd Qu.:2.0  
##                     Max.   :23.00   Max.   :4.320   Max.   :2.0  
##                                     NA's   :2                    
##  good_mood_likes bad_mood_likes    followers      polit_informed 
##  Min.   : 6.50   Min.   :12.20   Min.   : 61.40   Min.   :0.600  
##  1st Qu.:31.60   1st Qu.:39.08   1st Qu.: 76.47   1st Qu.:1.500  
##  Median :45.90   Median :49.30   Median :116.30   Median :1.800  
##  Mean   :43.04   Mean   :49.84   Mean   :124.76   Mean   :1.858  
##  3rd Qu.:53.40   3rd Qu.:58.75   3rd Qu.:153.75   3rd Qu.:2.200  
##  Max.   :89.20   Max.   :91.20   Max.   :336.50   Max.   :3.400  
##                                                                  
##  polit_campaign  polit_activism 
##  Min.   :0.800   Min.   :0.900  
##  1st Qu.:2.100   1st Qu.:2.400  
##  Median :2.550   Median :2.900  
##  Mean   :2.602   Mean   :2.977  
##  3rd Qu.:3.100   3rd Qu.:3.500  
##  Max.   :4.800   Max.   :5.500  
## 
Question: What is the mean of time_on_social ?
Figure 1: Deja Vu

Figure 1: Deja Vu

Now we are going to start looking at the new dataset for this week.

Info: All the information about this data and its variables is located in the ReadME_What is the sampling distribution of the mean anyway?.txt file. If you have NOT read this yet please make sure you do !

Activity 1 - Reading in Data

We are now going to read the dataset that we need for this week into R. The dataset can be read into an object called global_social_media. Please use the read.csv() and here() functions to read in the PSYC2001_global-time-on-social-data.csv file in the code block below.

#Use the read.csv() and here() functions to read in the dataset.

global_social_media <- read.csv(file = here("Data","PSYC2001_global-time-on-social-data.csv")) #your code goes here

Checking the UNSW value in this dataset.

Now, lets check whether the UNSW value (reminder this is U49 from the ReadME file!) matches the mean value we had from the first week.

This should be pretty easy to do, and makes use of the filter() function we used last week. This function is able to filter rows in your dataset that match a certain condition.

global_social_media %>% 
  filter(uni_id == "U49") # reminder filter is used to select rows based on given conditions
##   uni_id mean_time_on_social
## 1    U49                2.54

Yay ! The output should match the mean of what we calculated last week. But what does this mean ? Why have we bothered to show you this ?

Each value in the new data file is the mean value for the time_on_social variable, for each of the 500 experiments run (U1-U500), i.e. the data contains 500 samples of the mean for time_on_social. This dataset is the result of repeating a single experiment, many times.

Activity 2 - Finding the University of Sydney

Can you find the value for the University of Sydney (reminder this is ‘U102’ from the ReadMefile!) using the filter() function ?

global_social_media %>% 
  filter(uni_id == "U102")
##   uni_id mean_time_on_social
## 1   U102                2.68

Vectors and the sample function

We are now going to have a look at what happens to the sampling distribution of the mean as we increase the number of mean samples in our samples (confusing I know !)

To do this we are going to make use of the sample() function from baseR. Lets first have a look at what this function does by using the ? syntax. This should have already opened in a webpage when you first knitted the document.

?sample

This function takes a vector as its first argument. What this means is that we cannot just give it the entire dataframe as it does not know what to do with it. This will result in an error.

sample(global_social_media, size = 10)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

But you may be asking now, what is a vector? You can think of a vector as a single column in our dataframe. In this instance above, since the function only takes a single column, it gets overwhelmed when we pass it the entire dataframe. Poor function !

Figure 2: Specter on a Vector

Figure 2: Specter on a Vector

So we need to pass it only a single column. We can do this by using the $ operator from baseR. In essence the $ operator extracts a specified column from our dataframe.

Lets have a go at using it below:

head(global_social_media$uni_id) #head displays the first 6 elements. $ has been used to extract only the column uni_id
## [1] "U1" "U2" "U3" "U4" "U5" "U6"

What does the sampling distribution of the mean look like with only a few samples ?

Great ! So we now know how to pass a vector (column) into the sample() function. First lets try extracting a low number of samples, say 20 samples.

set.seed(1) #ensures we always get the same result from sampling ! 

mean_sample_20 <- sample(global_social_media$mean_time_on_social, size = 20) #randomly samples 20 data points from mean_time_on_social

mean_sample_20
##  [1] 3.08 1.56 1.82 2.31 3.07 3.48 2.42 2.90 1.90 2.45 3.07 2.31 2.00 2.02 2.31
## [16] 1.75 2.36 3.21 2.35 2.23
Info: You might notice we use the set.seed() function. The purpose of this is to create reproducible results from randomness. You can go here if you want to know more

This has extracted a sample of 20 means randomly from the global_social_media column mean_time_on_social.

Now, what we want to do next is to visualise this data using a histogram. However, there is a problem. The sample() function provides us with a single vector (column) but our plotting function ggplot() only likes dataframe. So we first need to convert our mean_sample_20 vector into a dataframe.

To do this we use the data.frame() function from baseR to create a dataframe.

mean_sample_20_df <- data.frame(sample_20 = mean_sample_20) #create a dataframe with a column called sample_20 that takes values from our vector

head(mean_sample_20_df)
##   sample_20
## 1      3.08
## 2      1.56
## 3      1.82
## 4      2.31
## 5      3.07
## 6      3.48

Now lets do some data visualisation ! We can create a histogram using ggplot() and the geom_historgram() functions from last week. We will be using a new function called labs() to label our x and y axis.

mean_sample_20_df %>% 
  ggplot(aes(x = sample_20)) +
  geom_histogram(fill = "skyblue", colour = "black") +  #fill and colour are Aesthetics. Fill controls the interior colour of shapes whereas colour controls the outline. 
  labs(x = "Time on Social media", y = "Count") #short for "labels", used to label axes and titles.

Info: We are using some new aesthetics this week. We use the fill “skyblue” and the colour “black” to control the interior colour and border of our histogram respectively. You will learn more aesthetics that can be used to create nicer looking plots each week
Question: What is the shape of the histogram here ? Is it as you expected ?

What does the sampling distribution of the mean look like as we add more samples?

Now lets see what happens when we add in more samples.


Activity 3 - Increasing the sample size

Are you able to use the sample() function, data.frame function to create objects with 100, 250, 350 and 500 samples ? Use the code blocks below to do this. If you need any help please ask your tutor !

Hint: This is just replicating what we have done above with some new object names.
#fill in the code below !

mean_sample_100 <- sample(global_social_media$mean_time_on_social, size = 100) #create a sampling distribution of the mean with 100 samples
  
mean_sample_250 <- sample(global_social_media$mean_time_on_social, size = 250) #create a sampling distribution of the mean with 250 samples
  
mean_sample_350 <- sample(global_social_media$mean_time_on_social, size = 350) #create a sampling distribution of the mean with 350 samples
  
mean_sample_500 <- sample(global_social_media$mean_time_on_social, size = 500) #create a sampling distribution of the mean with 500 samples
#fill in the code below ! 
mean_sample_100_df <- data.frame(sample_100 = mean_sample_100) #create a dataframe with a column called sample_100 that takes values from our vector

mean_sample_250_df <- data.frame(sample_250 = mean_sample_250) #create a dataframe with a column called sample_250 that takes values from our vector

mean_sample_350_df <- data.frame(sample_350 = mean_sample_350) #create a dataframe with a column called sample_350 that takes values from our vector

mean_sample_500_df <- data.frame(sample_500 = mean_sample_500) #create a dataframe with a column called sample_500 that takes values from our vector

Well done ! This was a hard activity. It you are struggling please ask your tutor for help.

Figure 3: Ask for help !

Figure 3: Ask for help !

Activity 4 - Visualising the sampling distribution of the mean with increasing samples

Next we need to visualise all of these new samples. It is important we do this so we can see what happens to our sampling distribution of the mean as we increase the number of mean samples.

We can do this by repeating the code for each histogram and giving it a new fill colour ! Are you able to help with this? (Note there are of course much cleaner ways to do this, if you are interested please see the guide here).

mean_sample_100_df %>% 
ggplot(aes(x = sample_100)) +
  geom_histogram(fill = "red", colour = "black") +
    labs(x = "Time on Social media", y = "Count")

mean_sample_250_df %>% 
ggplot(aes(x = sample_250)) +
  geom_histogram(fill = "blue", colour = "black") +
    labs(x = "Time on Social media", y = "Count")

mean_sample_350_df %>% 
ggplot(aes(x = sample_350)) +
  geom_histogram(fill = "green", colour = "black")+
    labs(x = "Time on Social media", y = "Count")

mean_sample_500_df %>% 
ggplot(aes(x = sample_500)) +
  geom_histogram(fill = "orange", colour = "black") +
    labs(x = "Time on Social media", y = "Count")

Question: How did the histogram change? Is it what you expected? Discuss this with your neighbours and your tutor.

Now imagine that there were infinite universities that ran this experiment, that each collected a group of people’s time_on_social scores, and each gave us their sample mean value. That is the theoretical sampling distribution of the mean.

The change in the shape of the histogram we have observed here is a critical implication of the central limit theorem - a large number of samples will lead to a approximately normal sampling distribution of the mean regardless of the actual population distribution.

Question: Now thats all well and good but what does this actually mean ? Why do you think this actually matters for the statistics we do ? Discuss this with your neighbour and tutors.

Extension - What happens to a the sampling distribution of the mean for other population distributions ?

This section is an extension activity if you have already finished the required materials. Please check with your tutor that you have a good grasp of the material before moving onto this section.

Figure 4: Extension students be like

Figure 4: Extension students be like

Now lets get into it. Lets see what happens when we use an exponential population distribution and find its sampling distribution of the mean with a large number of samples.

First, we are going to generate the population distribution. This can be done easily using the function rexp() which is used to generate exponential distributions.

# Generate an exponential population distribution
population <- data.frame(value = rexp(100000, rate = 1)) #generate an expotential distribution with 100,000 datapoints. 

Next lets generate a histogram of this population distribution and confirm that it looks like an exponential distribution.

# Plot population distribution
ggplot(population, aes(x = value)) +
  geom_histogram( bins = 100, fill = "skyblue", color = "black") +
  labs(
       x = "Value",
       y = "Frequency") +
  theme_classic() #themes can be provided to ggplot which give it a bunch of aesthetics to change. One of these is theme_classic

Question: Do you think this looks like an exponential distribution ? What should an exponential distribution look like ? Ask your tutor if you are not sure !

Now we can take samples from this population. We use a new function here could replicate() which basically repeats the process inside the {} brackets a specified number of times. Inside the {} brackets is what we are actually doing population distribution. First we are taking a sample using the sample() function, then we are taking the mean of that sample using the mean() function. This will generate a similar set of data to PSYC2001_global-time-on-social-data.csv. That is, means of a bunch of samples from the population (i.e the sampling distirbution of the mean)

# Lets take 500 samples from this population of size 50 per sample and calculate the mean. 
sample_means <- replicate(500, { #replicate the process 500 times
  sample_values <- sample(population$value, size = 50, replace = TRUE) # sample 50 values from the population mean
  mean(sample_values) #take the mean of those 50 sampled values
})

So what we have generated is a sampling distribution of the mean with 500 samples. Lets first convert that into a dataframe so that we can use ggplot to visualise this data.

# Put results in a dataframe for plotting
sampling_df <- data.frame(sample_mean = sample_means) #convert the result into a dataframe

#plot the results
sampling_df %>% 
ggplot(aes(x = sample_mean)) +
  geom_histogram(fill = "skyblue", color = "black") +
  labs(x = "Sample Mean", y = "Frequency") +
  theme_classic()

Question: What is the shape of this distribution ? is it different to the population distribution from above ? What are the implications of this ?

Well done you have completed another computing tutorial. This one has been very difficult and you have done a terrific job. See you all next week for more computing fun !

Figure 5: Celebrating finishing this tutorial !

Figure 5: Celebrating finishing this tutorial !